Tidy Tuesday Exercise

Introduction

For this exercise, I will participate in the TidyTuesday data exploration and analysis. The data is gathered from the TidyTuesday Github repository and loaded as a package. I will wrangle and explore the data. The conduct model fitting and finally test the data.

Set Up

First, I will start by loading all necessary packages and the data.

Load Packages

library(broom)
library(ggplot2)
library(here)
here() starts at /Users/mutsa_n/Desktop/MADA-course/mutsanyamuranga-MADA-portfolio
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(purrr)
library(tidyr)
library(base)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ readr     2.1.5     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(jsonlite)

Attaching package: 'jsonlite'

The following object is masked from 'package:purrr':

    flatten
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(here)
library(fs)
library(lubridate)
library(dplyr)
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
✔ dials        1.2.1      ✔ rsample      1.2.0 
✔ infer        1.0.6      ✔ tune         1.1.2 
✔ modeldata    1.3.0      ✔ workflows    1.1.4 
✔ parsnip      1.2.0      ✔ workflowsets 1.0.1 
✔ recipes      1.0.10     ✔ yardstick    1.2.0 
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard()   masks purrr::discard()
✖ dplyr::filter()     masks stats::filter()
✖ recipes::fixed()    masks stringr::fixed()
✖ jsonlite::flatten() masks purrr::flatten()
✖ dplyr::lag()        masks stats::lag()
✖ yardstick::spec()   masks readr::spec()
✖ recipes::step()     masks stats::step()
• Use tidymodels_prefer() to resolve common conflicts.
library(ranger)
library(yardstick)
library(leaflet)
library(maps)

Attaching package: 'maps'

The following object is masked from 'package:purrr':

    map

Load Data

library(tidytuesdayR)

tuesdata <- tidytuesdayR::tt_load(2024, week = 15)

    Downloading file 1 of 4: `eclipse_annular_2023.csv`
    Downloading file 2 of 4: `eclipse_total_2024.csv`
    Downloading file 3 of 4: `eclipse_partial_2023.csv`
    Downloading file 4 of 4: `eclipse_partial_2024.csv`
eclipse_annular_2023 <- tuesdata$eclipse_annular_2023
eclipse_total_2024 <- tuesdata$eclipse_total_2024
eclipse_partial_2023 <- tuesdata$eclipse_partial_2023
eclipse_partial_2024 <- tuesdata$eclipse_partial_2024

eclipse_annular_2023 <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-04-09/eclipse_annular_2023.csv')
eclipse_total_2024 <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-04-09/eclipse_total_2024.csv')
eclipse_partial_2023 <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-04-09/eclipse_partial_2023.csv')
eclipse_partial_2024 <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-04-09/eclipse_partial_2024.csv')

Exploration

Next, I will conduct some exploratory data anaylsis using graphs, plots and tables. To start, we have to understand what the data is and what it is tracking.

Data Structure

# Look at full summary of each data set
summary(eclipse_annular_2023)
    state               name                lat             lon         
 Length:811         Length:811         Min.   :27.22   Min.   :-124.45  
 Class :character   Class :character   1st Qu.:31.30   1st Qu.:-111.98  
 Mode  :character   Mode  :character   Median :35.42   Median :-106.70  
                                       Mean   :35.41   Mean   :-108.05  
                                       3rd Qu.:38.42   3rd Qu.:-101.36  
                                       Max.   :44.87   Max.   : -96.72  
  eclipse_1         eclipse_2         eclipse_3         eclipse_4       
 Length:811        Length:811        Length:811        Length:811       
 Class1:hms        Class1:hms        Class1:hms        Class1:hms       
 Class2:difftime   Class2:difftime   Class2:difftime   Class2:difftime  
 Mode  :numeric    Mode  :numeric    Mode  :numeric    Mode  :numeric   
                                                                        
                                                                        
  eclipse_5         eclipse_6       
 Length:811        Length:811       
 Class1:hms        Class1:hms       
 Class2:difftime   Class2:difftime  
 Mode  :numeric    Mode  :numeric   
                                    
                                    
summary(eclipse_partial_2023)
    state               name                lat             lon         
 Length:31363       Length:31363       Min.   :17.96   Min.   :-176.60  
 Class :character   Class :character   1st Qu.:35.36   1st Qu.: -97.50  
 Mode  :character   Mode  :character   Median :39.56   Median : -89.26  
                                       Mean   :38.80   Mean   : -91.97  
                                       3rd Qu.:41.93   3rd Qu.: -81.14  
                                       Max.   :71.25   Max.   : 174.11  
  eclipse_1         eclipse_2         eclipse_3         eclipse_4       
 Length:31363      Length:31363      Length:31363      Length:31363     
 Class1:hms        Class1:hms        Class1:hms        Class1:hms       
 Class2:difftime   Class2:difftime   Class2:difftime   Class2:difftime  
 Mode  :numeric    Mode  :numeric    Mode  :numeric    Mode  :numeric   
                                                                        
                                                                        
  eclipse_5       
 Length:31363     
 Class1:hms       
 Class2:difftime  
 Mode  :numeric   
                  
                  
summary(eclipse_partial_2024)
    state               name                lat             lon         
 Length:28844       Length:28844       Min.   :17.96   Min.   :-176.60  
 Class :character   Class :character   1st Qu.:35.24   1st Qu.: -99.08  
 Mode  :character   Mode  :character   Median :39.52   Median : -90.30  
                                       Mean   :38.76   Mean   : -93.00  
                                       3rd Qu.:42.04   3rd Qu.: -81.16  
                                       Max.   :71.25   Max.   : 174.11  
  eclipse_1         eclipse_2         eclipse_3         eclipse_4       
 Length:28844      Length:28844      Length:28844      Length:28844     
 Class1:hms        Class1:hms        Class1:hms        Class1:hms       
 Class2:difftime   Class2:difftime   Class2:difftime   Class2:difftime  
 Mode  :numeric    Mode  :numeric    Mode  :numeric    Mode  :numeric   
                                                                        
                                                                        
  eclipse_5       
 Length:28844     
 Class1:hms       
 Class2:difftime  
 Mode  :numeric   
                  
                  
summary(eclipse_total_2024)
    state               name                lat             lon         
 Length:3330        Length:3330        Min.   :28.45   Min.   :-101.16  
 Class :character   Class :character   1st Qu.:35.42   1st Qu.: -92.41  
 Mode  :character   Mode  :character   Median :39.24   Median : -86.56  
                                       Mean   :38.33   Mean   : -86.93  
                                       3rd Qu.:41.22   3rd Qu.: -82.31  
                                       Max.   :46.91   Max.   : -67.43  
  eclipse_1         eclipse_2         eclipse_3         eclipse_4       
 Length:3330       Length:3330       Length:3330       Length:3330      
 Class1:hms        Class1:hms        Class1:hms        Class1:hms       
 Class2:difftime   Class2:difftime   Class2:difftime   Class2:difftime  
 Mode  :numeric    Mode  :numeric    Mode  :numeric    Mode  :numeric   
                                                                        
                                                                        
  eclipse_5         eclipse_6       
 Length:3330       Length:3330      
 Class1:hms        Class1:hms       
 Class2:difftime   Class2:difftime  
 Mode  :numeric    Mode  :numeric   
                                    
                                    
# Look at names of variables
names(eclipse_annular_2023)
 [1] "state"     "name"      "lat"       "lon"       "eclipse_1" "eclipse_2"
 [7] "eclipse_3" "eclipse_4" "eclipse_5" "eclipse_6"
names(eclipse_partial_2023)
[1] "state"     "name"      "lat"       "lon"       "eclipse_1" "eclipse_2"
[7] "eclipse_3" "eclipse_4" "eclipse_5"
names(eclipse_partial_2024)
[1] "state"     "name"      "lat"       "lon"       "eclipse_1" "eclipse_2"
[7] "eclipse_3" "eclipse_4" "eclipse_5"
names(eclipse_total_2024)
 [1] "state"     "name"      "lat"       "lon"       "eclipse_1" "eclipse_2"
 [7] "eclipse_3" "eclipse_4" "eclipse_5" "eclipse_6"
# Look at structure of data sets
str(eclipse_annular_2023)
spc_tbl_ [811 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ state    : chr [1:811] "AZ" "AZ" "AZ" "AZ" ...
 $ name     : chr [1:811] "Chilchinbito" "Chinle" "Del Muerto" "Dennehotso" ...
 $ lat      : num [1:811] 36.5 36.2 36.2 36.8 35.7 ...
 $ lon      : num [1:811] -110 -110 -109 -110 -109 ...
 $ eclipse_1: 'hms' num [1:811] 15:10:50 15:11:10 15:11:20 15:10:50 ...
  ..- attr(*, "units")= chr "secs"
 $ eclipse_2: 'hms' num [1:811] 15:56:20 15:56:50 15:57:00 15:56:20 ...
  ..- attr(*, "units")= chr "secs"
 $ eclipse_3: 'hms' num [1:811] 16:30:29 16:31:21 16:31:13 16:29:50 ...
  ..- attr(*, "units")= chr "secs"
 $ eclipse_4: 'hms' num [1:811] 16:33:31 16:34:06 16:34:31 16:34:07 ...
  ..- attr(*, "units")= chr "secs"
 $ eclipse_5: 'hms' num [1:811] 17:09:40 17:10:30 17:10:40 17:09:40 ...
  ..- attr(*, "units")= chr "secs"
 $ eclipse_6: 'hms' num [1:811] 18:02:10 18:03:20 18:03:30 18:02:00 ...
  ..- attr(*, "units")= chr "secs"
 - attr(*, "spec")=
  .. cols(
  ..   state = col_character(),
  ..   name = col_character(),
  ..   lat = col_double(),
  ..   lon = col_double(),
  ..   eclipse_1 = col_time(format = ""),
  ..   eclipse_2 = col_time(format = ""),
  ..   eclipse_3 = col_time(format = ""),
  ..   eclipse_4 = col_time(format = ""),
  ..   eclipse_5 = col_time(format = ""),
  ..   eclipse_6 = col_time(format = "")
  .. )
 - attr(*, "problems")=<externalptr> 
str(eclipse_partial_2023)
spc_tbl_ [31,363 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ state    : chr [1:31363] "AL" "AL" "AL" "AL" ...
 $ name     : chr [1:31363] "Abanda" "Abbeville" "Adamsville" "Addison" ...
 $ lat      : num [1:31363] 33.1 31.6 33.6 34.2 32.9 ...
 $ lon      : num [1:31363] -85.5 -85.3 -87 -87.2 -87.7 ...
 $ eclipse_1: 'hms' num [1:31363] 15:41:20 15:42:30 15:38:20 15:37:50 ...
  ..- attr(*, "units")= chr "secs"
 $ eclipse_2: 'hms' num [1:31363] 16:23:30 16:25:50 16:20:50 16:19:50 ...
  ..- attr(*, "units")= chr "secs"
 $ eclipse_3: 'hms' num [1:31363] 17:11:10 17:13:50 17:07:50 17:06:50 ...
  ..- attr(*, "units")= chr "secs"
 $ eclipse_4: 'hms' num [1:31363] 18:00:00 18:03:10 17:56:30 17:55:10 ...
  ..- attr(*, "units")= chr "secs"
 $ eclipse_5: 'hms' num [1:31363] 18:45:10 18:49:30 18:42:10 18:40:30 ...
  ..- attr(*, "units")= chr "secs"
 - attr(*, "spec")=
  .. cols(
  ..   state = col_character(),
  ..   name = col_character(),
  ..   lat = col_double(),
  ..   lon = col_double(),
  ..   eclipse_1 = col_time(format = ""),
  ..   eclipse_2 = col_time(format = ""),
  ..   eclipse_3 = col_time(format = ""),
  ..   eclipse_4 = col_time(format = ""),
  ..   eclipse_5 = col_time(format = "")
  .. )
 - attr(*, "problems")=<externalptr> 
str(eclipse_partial_2024)
spc_tbl_ [28,844 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ state    : chr [1:28844] "AL" "AL" "AL" "AL" ...
 $ name     : chr [1:28844] "Abanda" "Abbeville" "Adamsville" "Addison" ...
 $ lat      : num [1:28844] 33.1 31.6 33.6 34.2 32.9 ...
 $ lon      : num [1:28844] -85.5 -85.3 -87 -87.2 -87.7 ...
 $ eclipse_1: 'hms' num [1:28844] 17:43:00 17:41:40 17:41:00 17:41:30 ...
  ..- attr(*, "units")= chr "secs"
 $ eclipse_2: 'hms' num [1:28844] 18:24:10 18:21:40 18:23:10 18:24:10 ...
  ..- attr(*, "units")= chr "secs"
 $ eclipse_3: 'hms' num [1:28844] 19:02:00 19:00:30 19:00:00 19:00:30 ...
  ..- attr(*, "units")= chr "secs"
 $ eclipse_4: 'hms' num [1:28844] 19:39:20 19:38:50 19:36:40 19:36:40 ...
  ..- attr(*, "units")= chr "secs"
 $ eclipse_5: 'hms' num [1:28844] 20:18:50 20:17:20 20:17:30 20:18:00 ...
  ..- attr(*, "units")= chr "secs"
 - attr(*, "spec")=
  .. cols(
  ..   state = col_character(),
  ..   name = col_character(),
  ..   lat = col_double(),
  ..   lon = col_double(),
  ..   eclipse_1 = col_time(format = ""),
  ..   eclipse_2 = col_time(format = ""),
  ..   eclipse_3 = col_time(format = ""),
  ..   eclipse_4 = col_time(format = ""),
  ..   eclipse_5 = col_time(format = "")
  .. )
 - attr(*, "problems")=<externalptr> 
str(eclipse_total_2024)
spc_tbl_ [3,330 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ state    : chr [1:3330] "AR" "AR" "AR" "AR" ...
 $ name     : chr [1:3330] "Acorn" "Adona" "Alexander" "Alicia" ...
 $ lat      : num [1:3330] 34.6 35 34.6 35.9 35.4 ...
 $ lon      : num [1:3330] -94.2 -92.9 -92.5 -91.1 -93.7 ...
 $ eclipse_1: 'hms' num [1:3330] 17:30:40 17:33:20 17:33:20 17:37:30 ...
  ..- attr(*, "units")= chr "secs"
 $ eclipse_2: 'hms' num [1:3330] 18:15:50 18:18:30 18:18:30 18:22:40 ...
  ..- attr(*, "units")= chr "secs"
 $ eclipse_3: 'hms' num [1:3330] 18:47:35 18:50:08 18:51:09 18:54:29 ...
  ..- attr(*, "units")= chr "secs"
 $ eclipse_4: 'hms' num [1:3330] 18:51:37 18:54:22 18:53:38 18:58:05 ...
  ..- attr(*, "units")= chr "secs"
 $ eclipse_5: 'hms' num [1:3330] 19:23:40 19:26:10 19:26:20 19:29:50 ...
  ..- attr(*, "units")= chr "secs"
 $ eclipse_6: 'hms' num [1:3330] 20:08:30 20:10:50 20:11:10 20:14:10 ...
  ..- attr(*, "units")= chr "secs"
 - attr(*, "spec")=
  .. cols(
  ..   state = col_character(),
  ..   name = col_character(),
  ..   lat = col_double(),
  ..   lon = col_double(),
  ..   eclipse_1 = col_time(format = ""),
  ..   eclipse_2 = col_time(format = ""),
  ..   eclipse_3 = col_time(format = ""),
  ..   eclipse_4 = col_time(format = ""),
  ..   eclipse_5 = col_time(format = ""),
  ..   eclipse_6 = col_time(format = "")
  .. )
 - attr(*, "problems")=<externalptr> 
#Look at number of rows and columns in each data set
nrow(eclipse_annular_2023)
[1] 811
nrow(eclipse_partial_2023)
[1] 31363
nrow(eclipse_partial_2024)
[1] 28844
nrow(eclipse_total_2024)
[1] 3330
ncol(eclipse_annular_2023)
[1] 10
ncol(eclipse_partial_2023)
[1] 9
ncol(eclipse_partial_2024)
[1] 9
ncol(eclipse_total_2024)
[1] 10

From the first looks, we see that there are four data sets with two that contain data from an annular eclipse in 2023 and two that contain data from a total eclipse in 2024. The variables captured are the location of the eclipse and time of day, capturing the state, name of the city, longitude and latitude. The five or six eclipse variables are the time of day at which the which the moon contacts the sun at the location at various points of the eclipse. For example, eclipse_3 is time at which annularity begins in this location in 2023 and time at which totality begins in this location in 2024.

Feature Engineering

I added a column for duration of visibility in minutes for all solar eclipses from first to last contact.

# Duration of the eclipse to total eclipse of 2024
eclipse_total_2024<- eclipse_total_2024 %>%
  mutate(
    eclipse_1_time = hms(eclipse_1),
    eclipse_6_time = hms(eclipse_6),
    duration = as.numeric(eclipse_6_time - eclipse_1_time)/60 )

# Duration of the eclipse to annular eclipse of 2023
eclipse_annular_2023<- eclipse_annular_2023 %>%
  mutate(
    eclipse_1_time = hms(eclipse_1),
    eclipse_6_time = hms(eclipse_6),
    duration = as.numeric(eclipse_6_time - eclipse_1_time)/60 )

# Duration of the eclipse to partial eclipse of 2024
eclipse_partial_2024<- eclipse_partial_2024 %>%
  mutate(
    eclipse_1_time = hms(eclipse_1),
    eclipse_5_time = hms(eclipse_5),
    duration = as.numeric(eclipse_5_time - eclipse_1_time)/60 )

# Duration of the eclipse to the partial eclipse of 2024
eclipse_partial_2023<- eclipse_partial_2023 %>%
  mutate(
    eclipse_1_time = hms(eclipse_1),
    eclipse_5_time = hms(eclipse_5),
    duration = as.numeric(eclipse_5_time - eclipse_1_time)/60 )

I will add a column of eclipse year in each of the data sets so that each observation can be identified as to which year the eclipse was from and also another column of eclipse type for the purpose of plotting.

# Identifier to the Total Eclipse 2024 data 
eclipse_total_2024 <- mutate(eclipse_total_2024, eclipse_type='Total_2024', eclipse_year='2024')

# Identifier  to the Annular Eclipse 2023 data 
eclipse_annular_2023 <- mutate(eclipse_annular_2023, eclipse_type='Annular_2023', eclipse_year = '2023')

# Identifier to the Partial Eclipse 2024 data 
eclipse_partial_2024 <- mutate(eclipse_partial_2024, eclipse_type='Partial_2024',eclipse_year='2024')

# Identifier to the Partial Eclipse 2023 data 
eclipse_partial_2023 <- mutate(eclipse_partial_2023, eclipse_type='Partial_2023',eclipse_year='2023')

Now, I will merge all of the datasets into one data set with all the 4 datasets by rows and kept state, city name, lattitude, longitude, duration and eclipse year in the final data. I converted the eclipse_year to a factor variable.

#Combining all the data sets by row
eclipse_all<- bind_rows(eclipse_total_2024, eclipse_annular_2023, eclipse_partial_2024,eclipse_partial_2023 )%>%
  #Selecting relevant columns
  select(state, name, lat, lon, duration, eclipse_year, eclipse_type)%>%
  #convert to factor
  mutate(eclipse_year=factor(eclipse_year))

Visualization

I will look further into how this data works is associated using graphs.

Scatter Plot

# Scatter plot of eclipse duration by latitude
ggplot(eclipse_all, aes(x = lon, y = duration, color = eclipse_type)) +
  geom_point() +
  labs(title = "Eclipse Duration by Longitude",
       x = "Longitude",
       y = "Duration (minutes)") +
  scale_color_manual(values = c("Total_2024" = "blue", "Annular_2023" = "green", "Partial_2024" = "orange", "Partial_2023" = "red")) +
  theme_minimal()

Map

# Define color palette for each eclipse type
eclipse_colors <- c("Partial_2024" = "orange", "Partial_2023" = "red", "Annular_2023" = "green", "Total_2024" = "blue")

# Create a leaflet map
map1 <- leaflet() %>%
  addTiles() %>%  # Add default OpenStreetMap tiles as the base layer
  setView(lng = -95.7129, lat = 37.0902, zoom = 2)  # Set initial view to focus on the world

# Create a leaflet map
map1 <- leaflet() %>%
  addTiles() %>%  # Add default OpenStreetMap tiles as the base layer
  setView(lng = -95.7129, lat = 37.0902, zoom = 2)  # Set initial view to focus on the world

# Add eclipse locations as markers to the map
map1 <- map1 %>%
   addCircleMarkers(data = eclipse_all, lng = ~lon, lat = ~lat,
                   radius = 5, color = ~eclipse_colors[eclipse_type],
                   popup = ~paste("Location: ", name, "<br/>", "Eclipse Type: ", eclipse_type, "<br/>", "Duration (minutes): ", duration))

# Display the map
map1
Input to asJSON(keep_vec_names=TRUE) is a named vector. In a future version of jsonlite, this option will not be supported, and named vectors will be translated into arrays instead of objects. If you want JSON object output, please use a named list instead. See ?toJSON.
Input to asJSON(keep_vec_names=TRUE) is a named vector. In a future version of jsonlite, this option will not be supported, and named vectors will be translated into arrays instead of objects. If you want JSON object output, please use a named list instead. See ?toJSON.

Histogram

# Histograms of eclipse duration for each year
ggplot(eclipse_all, aes(x = duration)) +
  geom_histogram(bins = 50, fill = "skyblue", color = "black") +  # Adjust bins as needed
  facet_wrap(~ eclipse_type, scales = "free_y") +  # Free y scales if counts vary significantly
  labs(title = "Duration by Eclipse Year and Type",
       x = "Duration",
       y = "Count") +
  theme_minimal()

Box Plot

# Box plot of duration by eclipse year
ggplot(eclipse_all, aes(x = eclipse_year, y = duration, fill = eclipse_year)) + 
  geom_boxplot() + 
  labs(title = "Distribution of duration by year of eclipse", x = "year", y = "Duration") +
  theme_minimal()

Model Building

I created three models using Cross-Validation to predict eclipse lengths:

Linear Model; Random Forest ; Decision Tree

First the data was randomly split into 80% train and 20% test sets with the models trained on the former.

# Set random seed for reproducibility
rngseed = 345
set.seed(rngseed)

# Assign 80% of the data into the training set
eclipsdata_split <- initial_split(eclipse_all, prop = .80)

# Create data frames for the train and test data
eclipstrain_data <- training(eclipsdata_split)
eclipstest_data <- testing(eclipsdata_split)
cv_fold <- vfold_cv(eclipstrain_data, v=10)

Define Each Model

# Linear Model
eclin_model <- linear_reg()%>%
  set_engine ("lm")%>%
  set_mode("regression")

# Random Forest Model
eclforest_model <- rand_forest()%>%
  set_engine("ranger", seed = rngseed)%>%
  set_mode("regression")

# Decision Tree Model
ecltree_model <- decision_tree()%>%
  set_engine("rpart")%>%
  set_mode("regression")

Workflow For All Models

# Workflow for Linear Model
eclin_wf <- workflow()%>%
  add_model(eclin_model)%>%
  add_formula(duration ~ eclipse_year)

# Workflow for Random Forest Model
eclforest_wf <- workflow()%>%
  add_model(eclforest_model)%>%
  add_formula(duration ~ eclipse_year)

# Workflow for Decision Tree Model
ecltree_wf <- workflow()%>%
  add_model(ecltree_model)%>%
  add_formula(duration ~ eclipse_year)

Fit The Models

# Define Resampling Control
resampling_control <- control_resamples(save_pred = TRUE)

# Linear Fit with CV
eclinfit_cv <- fit_resamples(eclin_wf, resamples=cv_fold, metrics = metric_set(rmse, rsq),control = resampling_control)

# Random forest Fit with CV
eclforestfit_cv <- fit_resamples (eclforest_wf, resamples = cv_fold, metrics = metric_set(rmse, rsq),control = resampling_control)

# Decision Tree Fit with CV
ecltreefit_cv <- fit_resamples(ecltree_wf, resamples = cv_fold, metrics = metric_set(rmse, rsq),control = resampling_control)

#Collecting Metrics
lin_metrics <-collect_metrics(eclinfit_cv)
RF_metrics <-collect_metrics(eclforestfit_cv)
DT_metrics <-collect_metrics(ecltreefit_cv)
Lin_predicts<- collect_predictions(eclinfit_cv)
Forest_predicts <- collect_predictions(eclforestfit_cv)
Tree_predicts <- collect_predictions(ecltreefit_cv)

I will choose the best model by comparing the RMSE metric, accuracy of the predicted vs observed value and residuals of the models.

Metrics

# Mean RMSE and R² for Each Model

## Linear
lin_rmse <- lin_metrics %>% filter(.metric == "rmse") %>% pull(mean)
lin_rsq <- lin_metrics %>% filter(.metric == "rsq") %>% pull(mean)

## Random Forest
RF_rmse <- RF_metrics %>% filter(.metric == "rmse") %>% pull(mean)
RF_rsq <- RF_metrics %>% filter(.metric == "rsq") %>% pull(mean)

## Decision Tree
DT_rmse <- DT_metrics %>% filter(.metric == "rmse") %>% pull(mean)
DT_rsq <- DT_metrics %>% filter(.metric == "rsq") %>% pull(mean)

## Metrics
cat("Linear Regression - RMSE:", lin_rmse, "R²:", lin_rsq, "\n")
Linear Regression - RMSE: 21.18793 R²: 0.1960076 
cat("Random Forest - RMSE:", RF_rmse, "R²:", RF_rsq, "\n")
Random Forest - RMSE: 21.18795 R²: 0.1960076 
cat("Decision Tree - RMSE:", DT_rmse, "R²:", DT_rsq, "\n")
Decision Tree - RMSE: 21.18793 R²: 0.1960076 

The linear and decision tree models perform marginally better than the random forest model based on the RMSE metric though the values are approximately similar.

Predictions

Below are the predicted values for the training data and residuals.

#Predicting on the training data
Lin_predicts<- collect_predictions(eclinfit_cv)
Forest_predicts <- collect_predictions(eclforestfit_cv)
Tree_predicts <- collect_predictions(ecltreefit_cv)

# Calculate Residuals
Lin_predicts <- Lin_predicts %>% mutate(residuals = .pred - duration)
Forest_predicts <- Forest_predicts %>% mutate(residuals = .pred - duration)
Tree_predicts <- Tree_predicts %>% mutate(residuals = .pred - duration)

# Labeling each data frame of predictions before combining
Lin_predicts$model <- "Linear Regression"
Forest_predicts$model <- "Random Forest"
Tree_predicts$model <- "Decision Tree"

# Combine all predictions into one dataframe
combine_predicts <- bind_rows(Lin_predicts, Forest_predicts, Tree_predicts)

Plotting

There are two plots:

Predicted vs Observed Values Plot A Residuals Plot.

# Predicted vs. Observed plot
ggplot(combine_predicts, aes(x = duration, y = .pred, color = model)) +
  geom_point(alpha = 0.6) +  
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black") +  # 45-degree line
  labs(
    title = "Predicted vs. Observed Values",
    x = "Observed",
    y = "Predicted"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("Decision Tree" = "orange", "Random Forest" = "purple", "Linear Regression" = "blue"))

# Residual Plot
ggplot(combine_predicts, aes(x = duration, y = residuals, color = model)) +
  geom_point(alpha = 0.6) +  # Adjust opacity with alpha if needed
  geom_hline(yintercept = 0,linetype = "dashed", color = "black") +  
  labs(
    title = "Residual Plot",
    x = "Observed",
    y = "Residuals"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("Decision Tree" = "red", "Random Forest" = "violet", "Linear Regression" = "lightblue"))

The models failed to capture the predictor-response relationship, despite overlapping predictions and residuals indicated by the plots. I retained the linear model and extracted coefficients to interpret the eclipse year’s impact on duration

Model Selection

# Coefficients of Fitted Model
eclin_fit <- fit(eclin_wf, data=eclipstrain_data)
eclin_est <- extract_fit_parsnip(eclin_fit)
tidy(eclin_est)
# A tibble: 2 × 5
  term             estimate std.error statistic p.value
  <chr>               <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)         165.      0.132     1253.       0
2 eclipse_year2024    -20.9     0.187     -112.       0

The linear model indicates that, on average, eclipse durations in 2024 were shorter than those in 2023. Next, I’ll apply this model to the test data for predictions and performance evaluation. The model was trained on the complete training data set before making predictions on the test data.

Training

# Train the Workflow on Training Data
eclin_final <- fit(eclin_wf, eclipstrain_data)

# Make Predictions on Test Data
ecltest_predicts <- predict(eclin_final, new_data = eclipstest_data)

# Combine Predictions to Test data
eclipstest_data <- eclipstest_data %>%
  bind_cols(ecltest_predicts) 

# Calculate Residuals
eclipstest_data <- eclipstest_data%>%mutate(residuals= .pred - duration)

# Calculate Performance Metrics
eclipstest_data %>%
  metrics(truth = duration, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      20.9  
2 rsq     standard       0.197
3 mae     standard      11.9  

As we can see the RMSE metric of the model has a slightly higher value for the test data compared to the training data.

Final Plotting of Tested Data

# Observed vs. predicted plot
ggplot(eclipstest_data, aes(x = duration, y = .pred)) +
  geom_point(alpha = 0.6) +  
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black") +  # 45-degree line
  labs(
    title = "Predicted vs. Observed Values",
    x = "Observed",
    y = "Predicted"
  ) +
  theme_minimal()

#Residual plot
ggplot(eclipstest_data, aes(x = duration, y = residuals)) +
  geom_point(alpha = 0.6) +  # Adjust opacity with alpha if needed
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +  
  labs(
    title = "Residual Plot",
    x = "Observed",
    y = "Residuals"
  ) +
  theme_minimal()

Conclusion

The Residual plot and the Predicted vs. Observed plots for the test set exhibited similar patterns to those observed in the training set. This consistency suggests that the model performs consistently on both sets and is unlikely to be over-fitting to the training data. However, the presence of systematic patterns in the residuals indicates that the model struggles to capture the underlying complexity of the data.

The analysis aimed to model eclipse duration based on the year, using data from the 2023 and 2024 eclipses sourced from the Tidy Tuesday GitHub repository. Initially, the hypothesis was that eclipses in 2023 had longer durations than those in 2024, based on preliminary exploratory data analysis. I compared Linear Regression, Random Forest, and Decision Tree models using cross-validation. Residuals from all models exhibited distinct patterns, suggesting that none of the models adequately captured the data’s complexity. Despite this, I opted for Linear Regression due to its simplicity and interpret-ability. The chosen model supported the hypothesis that 2023 eclipses had longer durations than 2024 eclipses. While the model performed consistently on both training and test data, its limitations were evident in residual patterns. To enhance model fit, future analysis could explore non-linear models or incorporate additional variables.